2 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
RStudio - Integrated Development Environment (IDE) for R1. Code Editing: RStudio provides a code editor with syntax highlighting, autocompletion, and error checking, making your coding process more efficient.
2. Console: An interactive R console allows you to execute R code line by line and view results in real time.
3. Environment Pane: Keep track of your variables, data frames, and functions with the environment pane.
4. Plots and Visualizations: Create and view plots, charts, and visualizations within RStudio.
5. Integrated Help: Access R documentation, packages, and online resources directly from the IDE.
6. Version Control: Easily integrate R projects with version control systems like Git.
7. Markdown Support: RStudio seamlessly integrates with Markdown, making it an ideal choice for creating reproducible reports and documents.
It plays a crucial role in promoting reproducibility and collaboration in data science and statistical analysis.
RStudioIntegrated Development Environment (IDE)
RStudio ProjectsEvery time you start a new (data analysis) project, make it a habit to create a new RStudio Project.
Because you want your project to work:
RStudio Projects create a convention that guarantees that the project can be moved around on your computer or onto other computers and will still “just work”.
RStudio projectEvery time you want to work on this project: open the project by clicking the .Rproj file.
Markdown is a lightweight markup language for creating formatted text using plain text. It’s easy to learn and widely used in various applications.
GitHub-Flavored Markdown (GFM) is a variant of Markdown used on GitHub (next week), enhancing its capabilities for documentation and collaboration.
RMarkdown is an extension of Markdown that allows you to embed R code and its output directly within a document.
Quarto is a comprehensive tool for creating reproducible and collaborative data science documents.
R)Jupyter Notebooks: Widely used interactive kernel-based computing environment for data science and machine learning, supporting multiple (i.e. almost all) programming and scripting languages.
RMarkdown: An R-based notebook environment that combines code, output, and narrative text in a single document.
YAML (YAML Ain’t Markup Language) is a human-readable data serialization format commonly used for configuration files and metadata in various programming and markup contexts.
YAML is very simple and readable
In Quarto and many other applications, YAML is used to specify:
Document Metadata: Information about the document itself, such as the title, author, date, and document type.
Document Configuration: Settings related to the document’s behavior, appearance, and rendering, such as the output format (e.g., HTML, PDF), document template, and style options.
Custom Variables: Definitions of custom variables or parameters that can be used throughout the document to control behavior or content.
Here’s an example of a simple YAML header in a Quarto document:
---
title: "All flavors markdown"
author:
- name: Gerko Vink
orcid: 0000-0001-9767-1924
email: g.vink@uu.nl
affiliations:
- name: Methodology & Statistics @ UU University
- name: Hanne Oberman
orcid: 0000-0003-3276-2141
email: h.i.oberman@uu.nl
affiliations:
- name: Methodology & Statistics @ UU
date: 25 Sep 2024
date-format: "D MMM YYYY"
bibliography: data/lec-2/publications.bib
execute:
echo: true
editor: source
format:
revealjs:
embed-resources: true
theme: [solarized, gerko.scss]
progress: true
multiplex: true
transition: fade
slide-number: true
margin: 0.075
logo: "images/logo.png"
toc: false
toc-depth: 1
toc-title: Outline
scrollable: true
reference-location: margin
footer: Gerko Vink and Hanne Oberman - Markup Languages @ UU
---In this example:
title, author, and date provide metadata about the document.output specifies settings related to the document’s output format and theme.The YAML header is a powerful tool for customizing and configuring Quarto documents, allowing you to control how the document is rendered and presented. It ensures that important document information and settings are stored in a human-readable and structured format at the beginning of the document.
QuartoText is text. Nothing more, nothing less
# This is a heading indicating a section
## This is a heading indicating a subsection
### This is a heading indicating a subsubsectionBut in the above I used
Why?
It is that simple. No more framing in \(\LaTeX\) or other stuff. Just use the # and ## to denote a section and a slide.
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
Scientific references are in the footer. Opinions and figures are my own, AI-generated or directly linked.
Materials
What is statistical programming?
Broadly speaking:
Computer programming is more focused on software development.
Statistical programming is more focused on data analysis and the communication of the results.
In this course we focus more on the HOW of doing data analysis in R. It is not primarily a course in statistics.
The most important skill for a programmer that creates usable code is that other people can use it, develop it further and understand it.
So far, we have learned the basics of programming in R:
<- (alt/option -)RStudio and R MarkdownRStudioRStudio and R only the base packages are activated: the basic installation with basic functionality.Use sessionInfo() to see which packages are active. This is how the basic installation looks like:
Packages are like apps on your mobile phone.
The easiest way to install a package, e.g. mice, is to use:
Alternatively, you can also do it in RStudio through:
Tools -> Install Packages
An overview of the packages you have installed, see the tab “Packages” in the output pane:
There are two ways to load a package in R:
and
When a package is not found (not installed):
require() will produce a warning but will continue to run the rest of the code.library() will produce an error and stop running the rest of the code.Everything that is published on the Comprehensive R Archive Network (CRAN) and is aimed at R users, must be accompanied by a help file.
In the search bar of the output pane:
In the console:
help(sample) or ?sample (opens a help window).help(package=mice) for packagessample in console or editor (Markdown code chunk) a pop-up window appears with help about the structure of the function.Type your search term in the search bar of the output pane.
In the console:
?? followed by your search term.??anova returns a list of all help pages that contain the word ‘anova’.Some packages have cheat sheets, see in R Studio, Help menu -> Cheat Sheets
Google the search term(s) and add ‘R’ as keyword.
Helpful websites: http://www.stackoverflow.com and http://www.stackexchange.com
Functions are the building blocks of R
Built-in or user-defined (programme your own functions).
To use a function, type the function name with parentheses: mean()
Typing the name of the function without the parentheses reveals the code of the function.
Every function in R has the following structure:
Image source: Garrett Grolemund, Hands-On Programming with R, 2.6
When you want to use a function in R, you need to know which information you need to provide to the function.
For example the function sample()
Use args(<function name>) to obtain info about the arguments and the default values:
Or make use of the pop-up help and use the TAB key to cycle through the arguments:
Clicking F1 opens the help file of the function sample():
Now we can use the function to, for example, mimic the sampling of two dice.
x represents the items to sample from (the range of possible items). In this case the numbers 1 to 6 (the eyes of single die).
size is the number of items to choose, in this case 2
replace=TRUE means sampling with replacement
Will the function work if we leave out the argument names and give only the values?
And if we change the order of the values?
Changing the order is possible only when the argument is mentioned.
Recommendation: type out the arguments and their values. This prevents errors and increases the readability of your code.
A vector is an indexed set of values (a list of numbers) and has one dimension (row vector or column vector). The simplest vector has 1 element.
c() creates a list of numbers:
Vectors can have the following data atomic modes: integer, numeric/double, character, logical, complex
Numeric (double):
Integer:
Character:
[1] "u" "v" "w" "x" "y" "z"
[1] "Mike" "Anne" "George"
Logical:
With c()
Simple replication with rep()
Or more complex:
[1] "A" "A" "B" "B" "B"
[1] "A" "A" "A" "B" "B" "B"
Sequence of numbers with seq()
matrix() creates arrays with specified dimensions, e.g. vectors:
A matrix:
Vectors and matrices can only hold one data type. Remember, matrices and vectors are numerical OR character objects. They can never contain both and still be used for numerical calculations.
Vectors and matrices can only hold one data type
Lists are flexible data structures: the elements in a list may be a combination of different data types (numeric, character) and dimensions.
Assign names to the elements of a list with names(). Notice the $.
A data frame is the R representation of a rectangular data set where the rows are the observations and the columns the variables.
Data frames can contain both numerical and character column vectors at the same time, although never in the same column.
V1 V2 V3
1 0.1292877 4.108676 a
2 1.7150650 7.448164 b
3 0.4609162 5.719628 c
4 -1.2650612 5.801543 d
5 -0.6868529 5.221365 e
We ‘filled’ a data frame with two randomly generated sets from the normal distribution - where \(V1\) is standard normal and \(V2 \sim N(5,2)\) - and a character set.
You can name the columns and rows in data frames with row.names:
Factors are used to represent categorical data (ordered or unordered).
A factor is a vector with integers where each integer has a label.
Factors facilitate interpretation of results in statistical modeling: a variable with labels “male”, “female” is self-describing compared to a variable with values 1, 2.
Factors are very useful in statistical modeling (linear models, GLM) where they facilitate the dummy coding process of categorical variables.
Factor objects can be created with the factor() function.
[1] male male female male female
Levels: female male
Obtain the summary of the factor:
Factors are integer vectors where each integer has a label (levels):
In the basic installation of R (“base R”) there are three ways to select elements from vectors, matrices, lists and data frames:
[]
[[]]
$
[]Square brackets [] are used to call single elements or entire rows and columns.
[a, b]: a refers to the row number(s), b refers to the column number(s).
[]Also for data frames:
[][] V1 V2 V3
row 1 0.1292877 4.108676 a
row 2 1.7150650 7.448164 b
row 3 0.4609162 5.719628 c
row 4 -1.2650612 5.801543 d
row 5 -0.6868529 5.221365 e
[1] 7.448164 5.719628
V2 V3
row 1 4.108676 a
[] V1 V2 V3
row 1 0.1292877 4.108676 a
row 2 1.7150650 7.448164 b
row 3 0.4609162 5.719628 c
row 4 -1.2650612 5.801543 d
row 5 -0.6868529 5.221365 e
V1 V2
row 1 0.1292877 4.108676
row 2 1.7150650 7.448164
row 3 0.4609162 5.719628
row 4 -1.2650612 5.801543
row 5 -0.6868529 5.221365
[][[]]The [[]] operator selects only one element
$Use $ to select elements with name labels in lists or data frames:
$Names
[1] "Mike" "Anne" "George"
$Numbers
[1] 25 26 27 28 29 30
$Matrix
[,1] [,2]
[1,] "25" "u"
[2,] "26" "v"
[3,] "27" "w"
[4,] "28" "x"
[5,] "29" "y"
[6,] "30" "z"
$Use $ to select a variable in a data frame:
V1 V2 V3
row 1 0.1292877 4.108676 a
row 2 1.7150650 7.448164 b
row 3 0.4609162 5.719628 c
row 4 -1.2650612 5.801543 d
row 5 -0.6868529 5.221365 e
Logical operators are signs that evaluate a statement, such as ==, <, >, <=, >=, and | (OR) as well as & (AND). Typing ! before a logical operator takes the complement of that action.
For example, if we would like to select elements of vector v that are larger than 6, we would type:
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
The column values for TRUE may be of different length. A vector as a return is therefore more appropriate. The TRUE and FALSE values serve as indicators to select the elements in v larger than 6.
| Symbol | Meaning |
|---|---|
| ! | logical not |
| \(\&\) | logical and |
| \(|\) | logical or |
| \(<\) | less than |
| \(<=\) | less than or equal to |
| \(>\) | greater than |
| \(>=\) | greater than or equal to |
| \(==\) | logical equals |
| \(!=\) | not equal |
In R there are two types of numbers: integers and floating point numbers. Since computer memory is limited, you cannot store numbers with infinite precision. Numbers are therefore represented with floating point numbers. Floating points cannot represent decimal fractions exactly in most cases.
Why does R tell us that 3 - 2.9 ≠ 0.1?
Let’s have a look at how the decimal fractions are actually represented as floating points. You can see this by asking a representation with 54 decimals.
The difference of 8.326673e-17 is smaller than the representable difference between two numbers whose value is close to 0.1.
The smallest positive floating point number in R is: 2.220446e-16
You can verify whether the difference between two floating points is smaller than the smallest positive floating point number (2.220446e-16).
Or use the all.equal() function which checks that the difference is close to the smallest floating point number.
Gerko Vink @ Anton de Kom Universiteit, Paramaribo